DataBase-ball

Course DS 250

Author

Spencer Driggs

Elevator pitch

Despite what all the anti-“sportsball” folk would say, interest in sports across the country continues to grow. Baseball in particular is growing in international appeal, and the Olympic gold medal has not only gone to the United States but also Cuba, South Korea, and Japan in the last 6 Olympics. The more we can understand the trends of growth in American baseball, the better we can recognize the trends going on internationally right now. This baseball database will help us do that!

Grand Question 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

BYUI Players
# Include and execute your code here
batting = pd.read_sql_query('''
SELECT CollegePlaying.playerID, 
schoolID, salary, CollegePlaying.yearID, teamID FROM CollegePlaying LEFT JOIN Salaries USING (playerID) WHERE schoolID="idbyuid" ORDER BY salary desc''', con)
Markdown(batting.to_markdown(index=False))
playerID schoolID salary yearID teamID
lindsma01 idbyuid 4e+06 2001 CHA
lindsma01 idbyuid 4e+06 2002 CHA
lindsma01 idbyuid 3.6e+06 2001 BAL
lindsma01 idbyuid 3.6e+06 2002 BAL
lindsma01 idbyuid 2.8e+06 2001 COL
lindsma01 idbyuid 2.8e+06 2002 COL
lindsma01 idbyuid 2.3e+06 2001 CHA
lindsma01 idbyuid 2.3e+06 2002 CHA
lindsma01 idbyuid 1.625e+06 2001 HOU
lindsma01 idbyuid 1.625e+06 2002 HOU
stephga01 idbyuid 1.025e+06 1991 SLN
stephga01 idbyuid 1.025e+06 1992 SLN
stephga01 idbyuid 900000 1991 SLN
stephga01 idbyuid 900000 1992 SLN
stephga01 idbyuid 800000 1991 SLN
stephga01 idbyuid 800000 1992 SLN
stephga01 idbyuid 550000 1991 SLN
stephga01 idbyuid 550000 1992 SLN
lindsma01 idbyuid 410000 2001 FLO
lindsma01 idbyuid 410000 2002 FLO
lindsma01 idbyuid 395000 2001 FLO
lindsma01 idbyuid 395000 2002 FLO
lindsma01 idbyuid 380000 2001 FLO
lindsma01 idbyuid 380000 2002 FLO
stephga01 idbyuid 215000 1991 SLN
stephga01 idbyuid 215000 1992 SLN
stephga01 idbyuid 185000 1991 PHI
stephga01 idbyuid 185000 1992 PHI
stephga01 idbyuid 150000 1991 PHI
stephga01 idbyuid 150000 1992 PHI
catetr01 idbyuid nan 2002

There aren’t many professional baseball players from BYU-Idaho, which isn’t too surprising given its past Division III status. I thought about filtering out Mr. Catetr, but with so few to even make it big I figured we’d let him take the credit.

Grand Question 2

A) Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

plot example
# Include and execute your code here
batting1 = pd.read_sql_query('''
SELECT playerId, 
    yearID, H, AB, 
    (H+0.0)/(AB+0.0) AS "Batting Average" 
FROM Batting 
WHERE AB >= 1 
ORDER BY H/AB desc, playerID
LIMIT 5 ''', con)
Markdown(batting1.to_markdown(index=False))
playerID yearID H AB Batting Average
aberal01 1957 1 1 1
abernte02 1960 1 1 1
abramge01 1923 1 1 1
acklefr01 1964 1 1 1
alanirj01 2019 1 1 1

With so many players in the database, it’s not surprising that we have a few that happened to get on base in their only hit of the season.

B) Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.

table example
# Include and execute your code here
batting2 = pd.read_sql_query('''
SELECT playerId, 
    yearID, 
    (H+0.0) / (AB+0.0) AS "BattingAverage" 
FROM Batting 
WHERE AB >= 10 
ORDER BY H/AB desc, playerID 
LIMIT 5''', con)
Markdown(batting2.to_markdown(index=False))
playerID yearID BattingAverage
aaronha01 1954 0.279915
aaronha01 1955 0.313953
aaronha01 1956 0.328407
aaronha01 1957 0.321951
aaronha01 1958 0.326123

C) Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

Show the code
batting3 = pd.read_sql_query('''
SELECT playerId,  
    sum(H+0.0) / sum(AB) AS "Batting Average" 
FROM Batting 
GROUP BY playerId
HAVING sum(AB) >= 100
ORDER BY "Batting Average" desc, playerID 
LIMIT 5''', con)
Markdown(batting3.to_markdown(index=False))
playerID Batting Average
cobbty01 0.366299
barnero01 0.359682
hornsro01 0.358497
jacksjo01 0.355752
meyerle01 0.355509

Grand Question 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph in Altair to visualize the comparison. What do you learn?

I’ve chosen to compare the season-by-season hits of the Chicago Cubs against the Seattle Mariners. I picked the Cubs because I grew up in Chicago and the Cubs are the only team I know about from Chicago, and the Seattle Mariners in honor of the greatest chart-driven documentary series of all time:

History of the Seattle Mariners

Each team is in its main color, with black circles representing the times they won their division and made it to the playoffs

table example
batting4 = pd.read_sql_query('''SELECT name, yearID, H, DivWin
FROM teams 
WHERE name == "Seattle Mariners" 
    OR name == "Chicago Cubs" AND yearID > 1978 
''',con)

MarinersTeam = (alt.Chart(batting4[batting4.name=='Seattle Mariners'])
.encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),
y = alt.Y('H',axis=alt.Axis(title='How many Hits?'),scale=alt.Scale(domain=(600,1800))))
.properties(title='How do they do?')
.mark_line(color='#005C5C')
)

Playoff = (alt.Chart(batting4[batting4.DivWin == 'Y'])
.encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),
y = alt.Y('H',axis=alt.Axis(title='How many Hits?'),scale=alt.Scale(domain=(600,1800))))
.properties(title='How do they do?')
.mark_circle(color='black')
)

CubsTeam = (alt.Chart(batting4[batting4.name=='Chicago Cubs'])
.encode(x=alt.X('yearID', axis=alt.Axis(title='Year', format='d')),
y = alt.Y('H',axis=alt.Axis(title='How many Hits per Season?'),scale=alt.Scale(domain=(600, 1800))))
.properties(title='Both Teams are Middle of the Road')
.mark_line(color='red')
)

MarinersTeam + CubsTeam + Playoff

The big drops in 1981 and 1994 are the result of player strikes against the owners. Less games were played that season, which means all teams had less opportunity to get hits. Compared to previous seasons they look really bad, but compared to other teams they were about average. Outside of those strike years, there isn’t much connection between the two teams’s hits over the years. We can judge by the lack of division wins that they were both pretty mediocre through most of the 2010s, and the declining number of hits seems to match it as well.